Search CORE

31 research outputs found

Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise

Author: Aepli Noëmi
Sennrich Rico
Publication venue: Association for Computational Linguistics
Publication date: 01/05/2022
Field of study

Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity. However, current approaches that operate in the embedding space do not take surface similarity into account. This work presents a simple yet effective strategy to improve cross-lingual transfer between closely related varieties. We propose to augment the data of the high-resource source language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Finnic, West and North Germanic, and Western Romance language branches. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language varieties

ZORA

Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by Injecting Character-level Noise

Author: Aepli Noëmi
Sennrich Rico
Publication venue
Publication date: 11/03/2022
Field of study

Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity. However, current approaches that operate in the embedding space do not take surface similarity into account. This work presents a simple yet effective strategy to imrove cross-lingual transfer between closely related varieties. We propose to augment the data of the high-resource source language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Finnic, West and North Germanic, and Western Romance language branches. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language varieties.Comment: ACL 202

arXiv.org e-Print Archive

Edinburgh Research Explorer

ZORA

Reducing Gender Bias in NMT with FUDGE

Author: Aepli Noëmi
Lu Tianshuai
Rios Annette
Publication venue
Publication date: 15/06/2023
Field of study

Gender bias appears in many neural machine translation (NMT) models and commercial translation software. Research has become more aware of this problem in recent years and there has been work on mitigating gender bias. However, the challenge of addressing gender bias in NMT persists. This work utilizes a controlled text generation method, Future Discriminators for Generation (FUDGE), to reduce the so-called Speaking As gender bias. This bias emerges when translating from English to a language that openly marks the gender of the speaker. We evaluate the model on MuST-SHE, a challenge set to specifically evaluate gender translation. The results demonstrate improvements in the translation accuracy of the feminine terms

ZORA

On Biasing Transformer Attention Towards Monotonicity

Author: Aepli Noëmi
Amrhein Chantal
Rios Annette
Sennrich Rico
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 08/04/2021
Field of study

Many sequence-to-sequence tasks in natural language processing are roughly monotonic in the alignment between source and target sequence, and previous work has facilitated or enforced learning of monotonic attention behavior via specialized attention functions or pretraining. In this work, we introduce a monotonicity loss function that is compatible with standard attention mechanisms and test it on several sequence-to-sequence tasks: grapheme-to-phoneme conversion, morphological inflection, transliteration, and dialect normalization. Experiments show that we can achieve largely monotonic behavior. Performance is mixed, with larger gains on top of RNN baselines. General monotonicity does not benefit transformer multihead attention, however, we see isolated improvements when only a subset of heads is biased towards monotonic behavior.Comment: To be published in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2021

arXiv.org e-Print Archive

Edinburgh Research Explorer

Building a Parallel Corpus on the World's Oldest Banking Magazine

Author: Aepli Noëmi
Amrhein Chantal
Müller Mathias
Ströbel Phillip
Volk Martin
Publication venue: s.n.
Publication date: 21/09/2016
Field of study

We report on our processing steps to build a diachronic parallel corpus based on the world's oldest banking magazine. The magazine has been published since 1895 in German, with translations in French and partly in English and Italian. Our data sources are printed issues (until 1997), PDF issues (since 1998) and HTML files (since 2001). The corpus building poses special challenges in article boundary recognition and cross-language article and sentence alignment. Our corpus fills a gap in parallel corpora with respect to genre (magazine articles), domain (banking and economy articles), and its time span (120 years)

ZORA

Part-of-Speech Tag Disambiguation by Cross-Linguistic Majority Vote

Author: Aepli Noëmi
Samardžić Tanja
von Waldenfels Ruprecht
Publication venue: Proceedings of the First Workshop on Applying NLP Tools to Similar Languages
Publication date: 23/08/2014
Field of study

ZORA

Findings of the VarDial Evaluation Campaign 2022

Author: Aepli Noëmi
Anastasopoulos Antonios
Chifu Adrian-Gabriel
Domingues William
Faisal Fahim
Gaman Mihaela
Ionescu Radu Tudor
Scherrer Yves
Publication venue: Association for Computational Linguistics
Publication date: 01/10/2022
Field of study

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2022. The campaign is part of the ninth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2022. Three separate shared tasks were included this year: Identification of Languages and Dialects of Italy (ITDI), French Cross-Domain Dialect Identification (FDI), and Dialectal Extractive Question Answering (DialQA). All three tasks were organized for the first time this year

ZORA

Findings of the VarDial Evaluation Campaign 2022

Author: Aepli Noëmi
Anastasopoulos Antonios
Chifu Adrian-Gabriel
Domingues William
Faisal Fahim
Gaman Mihaela
Ionescu Radu Tudor
Scherrer Yves
Publication venue: COLING
Publication date: 01/10/2022
Field of study

Helsingin yliopiston digitaalinen arkisto

Findings of the VarDial Evaluation Campaign 2023

Author: Aepli Noëmi
Cöltekin Çağrı
Jauhiainen Tommi
Kazzaz Mourhaf
Ljube\vić Nikola
North Kai
Plank Barbara
Scherrer Yves
Van Der Goot Rob
Zampieri Marcos
Publication venue: Association for Computational Linguistics
Publication date: 05/05/2023
Field of study

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages — True Labels (DSL-TL), and Discriminating Between Similar Languages — Speech (DSL-S). All three tasks were organized for the first time this year

ZORA

Findings of the VarDial Evaluation Campaign 2023

Author: Aepli Noëmi
Jauhiainen Tommi
Kazzaz Mourhaf
Ljubešić Nikola
North Kai
Plank Barbara
Scherrer Yves
Van Der Goot Rob
Zampieri Marcos
Çöltekin Çağrı
Publication venue
Publication date: 31/05/2023
Field of study

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages -- True Labels (DSL-TL), and Discriminating Between Similar Languages -- Speech (DSL-S). All three tasks were organized for the first time this year

arXiv.org e-Print Archive